World Health Organization has estimated 12 million deaths occur worldwide, every year due to Heart diseases. Half the deaths in the United States are due to cardio vascular diseases. The early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in high risk patients and in turn reduce the complications. This research intends to pinpoint the most relevant/risk factors of heart disease. The data includes:
• Sex: male or female(Nominal) • Age: Age of the patient;(Continuous) • Current Smoker: whether or not the patient is a current smoker (Nominal) • Cigs Per Day: the number of cigarettes that the person smoked on average in one day • BP Meds: whether or not the patient was on blood pressure medication (Nominal) • Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal) • Prevalent Hyp: whether or not the patient was hypertensive (Nominal) • Diabetes: whether or not the patient had diabetes (Nominal) • Tot Chol: total cholesterol level (Continuous) • Sys BP: systolic blood pressure (Continuous) • Dia BP: diastolic blood pressure (Continuous) • BMI: Body Mass Index (Continuous) • Heart Rate: heart rate (Continuous) • Glucose: glucose level (Continuous) • 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)
This assignment is structured as follows: In the first part, we present our data. In part 2, we tidy, explore and describe our data, making it possible to further process it. In part 3, we provide basic prediction models, supported by the explanations, interpretations, and graphs. In part 4, We improve our models by making them more complex, which is also accompanied with a discussion. In the part 5, conclusions about predictions are given
Lets explore format of each column.
str(data)
## 'data.frame': 4238 obs. of 16 variables:
## $ male : int 1 0 1 0 0 0 0 0 1 1 ...
## $ age : int 39 46 48 61 46 43 63 45 52 43 ...
## $ education : int 4 2 1 3 3 2 1 2 1 1 ...
## $ currentSmoker : int 0 0 1 1 1 0 0 1 0 1 ...
## $ cigsPerDay : int 0 0 20 30 23 0 0 20 0 30 ...
## $ BPMeds : int 0 0 0 0 0 0 0 0 0 0 ...
## $ prevalentStroke: int 0 0 0 0 0 0 0 0 0 0 ...
## $ prevalentHyp : int 0 0 0 1 0 1 0 0 1 1 ...
## $ diabetes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ totChol : int 195 250 245 225 285 228 205 313 260 225 ...
## $ sysBP : num 106 121 128 150 130 ...
## $ diaBP : num 70 81 80 95 84 110 71 71 89 107 ...
## $ BMI : num 27 28.7 25.3 28.6 23.1 ...
## $ heartRate : int 80 95 75 65 85 77 60 79 76 93 ...
## $ glucose : int 77 76 70 103 85 99 85 78 79 88 ...
## $ TenYearCHD : int 0 0 0 1 0 0 1 0 0 0 ...
Not all columns were imported as data types that we expected but, so we need to change the type for some of the variables.
#specifying factor variables
dataupd <- data %>%
mutate(male = as.factor(male),
currentSmoker = as.factor(currentSmoker),
education = as.factor(education),
BPMeds = as.factor(BPMeds),
prevalentStroke = as.factor(prevalentStroke),
prevalentHyp = as.factor(prevalentHyp),
diabetes = as.factor( diabetes),
TenYearCHD = as.factor(TenYearCHD))
Now, they are specified in a more relevant manner.
Since factors are specified correctly we are able to explore missing data patterns:
md.pattern(dataupd)
## male age currentSmoker prevalentStroke prevalentHyp diabetes sysBP diaBP
## 3656 1 1 1 1 1 1 1 1
## 331 1 1 1 1 1 1 1 1
## 93 1 1 1 1 1 1 1 1
## 8 1 1 1 1 1 1 1 1
## 51 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1
## 9 1 1 1 1 1 1 1 1
## 38 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1
## 23 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1
## 13 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0
## TenYearCHD heartRate BMI cigsPerDay totChol BPMeds education glucose
## 3656 1 1 1 1 1 1 1 1 0
## 331 1 1 1 1 1 1 1 0 1
## 93 1 1 1 1 1 1 0 1 1
## 8 1 1 1 1 1 1 0 0 2
## 51 1 1 1 1 1 0 1 1 1
## 1 1 1 1 1 1 0 1 0 2
## 9 1 1 1 1 0 1 1 1 1
## 38 1 1 1 1 0 1 1 0 2
## 1 1 1 1 1 0 1 0 1 2
## 1 1 1 1 1 0 0 1 0 3
## 23 1 1 1 0 1 1 1 1 1
## 4 1 1 1 0 1 1 1 0 2
## 2 1 1 1 0 1 1 0 1 2
## 13 1 1 0 1 1 1 1 1 1
## 4 1 1 0 1 1 1 1 0 2
## 1 1 1 0 1 1 1 0 1 2
## 1 1 1 0 1 0 1 1 0 3
## 1 1 0 1 1 1 1 1 1 1
## 0 1 19 29 50 53 105 388 645
The md.pattern shows that there are not many missing values in this dataset. PMeds, education, and glucose are the variables which have absent values the most often. It generally can be ignored since the observations in general are usually complete. We will, however, fill in missing values by mean imputation. Later, data can be restored in a different way.
# Maybe deleting all values is better than mean imputation, as your underestimating the uncertainty, does not preserve the variability in the data.
dataupd_nomissingvalues <- na.omit(dataupd)
education cigsPerDay BPMeds totChol BMI heartRate glucose
imp <- mice(dataupd, seed = 123)
##
## iter imp variable
## 1 1 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 1 2 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 1 3 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 1 4 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 1 5 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 2 1 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 2 2 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 2 3 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 2 4 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 2 5 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 3 1 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 3 2 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 3 3 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 3 4 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 3 5 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 4 1 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 4 2 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 4 3 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 4 4 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 4 5 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 5 1 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 5 2 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 5 3 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 5 4 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 5 5 education cigsPerDay BPMeds totChol BMI heartRate glucose
plot(imp)
imp$meth
## male age education currentSmoker cigsPerDay
## "" "" "polyreg" "" "pmm"
## BPMeds prevalentStroke prevalentHyp diabetes totChol
## "logreg" "" "" "" "pmm"
## sysBP diaBP BMI heartRate glucose
## "" "" "pmm" "pmm" "pmm"
## TenYearCHD
## ""
ini <- mice(data = dataupd, pred=quickpred(dataupd, mincor=.1), m = 5, maxit = 20, print=F, seed = 123)
plot(ini)
ini$method
## male age education currentSmoker cigsPerDay
## "" "" "polyreg" "" "pmm"
## BPMeds prevalentStroke prevalentHyp diabetes totChol
## "logreg" "" "" "" "pmm"
## sysBP diaBP BMI heartRate glucose
## "" "" "pmm" "pmm" "pmm"
## TenYearCHD
## ""
stripplot(imp)
stripplot(ini)
miceobj <- mice(dataupd)
##
## iter imp variable
## 1 1 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 1 2 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 1 3 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 1 4 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 1 5 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 2 1 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 2 2 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 2 3 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 2 4 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 2 5 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 3 1 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 3 2 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 3 3 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 3 4 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 3 5 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 4 1 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 4 2 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 4 3 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 4 4 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 4 5 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 5 1 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 5 2 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 5 3 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 5 4 education cigsPerDay BPMeds totChol BMI heartRate glucose
## 5 5 education cigsPerDay BPMeds totChol BMI heartRate glucose
dataupd <- complete(miceobj)
To simplify model interpretation.
#specify function
normalise <- function(x) (x-min(x))/(max(x)-min(x))
dataupd <- dataupd %>%
mutate_if(is.integer, as.numeric) %>%
mutate_if(is.numeric, normalise)
# This is with z score
# Define a function to standardize (z-score normalize) a numeric vector
standardize <- function(x) {
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}
# Apply z-score standardization to all numeric columns in dataupd_nomissingvalues
dataupd_nomissingvalues <- dataupd_nomissingvalues %>%
mutate_if(is.numeric, standardize)
mod1 <- glm(TenYearCHD ~ male + age + currentSmoker + diabetes + BMI + glucose, data = dataupd, family = "binomial")
summary(mod1)
##
## Call:
## glm(formula = TenYearCHD ~ male + age + currentSmoker + diabetes +
## BMI + glucose, family = "binomial", data = dataupd)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.43263 0.20982 -21.126 < 2e-16 ***
## male1 0.47891 0.09178 5.218 1.81e-07 ***
## age 2.94571 0.21134 13.938 < 2e-16 ***
## currentSmoker1 0.39687 0.09591 4.138 3.50e-05 ***
## diabetes1 -0.03314 0.30508 -0.109 0.9135
## BMI 1.41072 0.44694 3.156 0.0016 **
## glucose 3.37948 0.75547 4.473 7.70e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3611.5 on 4237 degrees of freedom
## Residual deviance: 3294.7 on 4231 degrees of freedom
## AIC: 3308.7
##
## Number of Fisher Scoring iterations: 5
# Set seed for reproducibility
set.seed(123)
# Train 80% and test 20%
train_proportion <- 0.8
# Create an index for splitting the data
train_index <- createDataPartition(dataupd_nomissingvalues$TenYearCHD, p = train_proportion, list = FALSE)
# Split the data into training and test sets
train_data <- dataupd_nomissingvalues[train_index, ]
test_data <- dataupd_nomissingvalues[-train_index, ]
# For the training dataset (train_data)
train_predictors <- train_data[, c("male", "age", "currentSmoker", "diabetes", "BMI", "glucose")]
train_target <- train_data$TenYearCHD
# For the test dataset (test_data)
test_predictors <- test_data[, c("male", "age", "currentSmoker", "diabetes", "BMI", "glucose")]
test_target <- test_data$TenYearCHD
# Cross-validation
cv <- trainControl(method = "cv", number = 5) # 5-fold cross-validation
# Create the logistic regression model with cross-validation
model_logistic <- train(
x = train_predictors,
y = train_target,
method = "glm",
family = "binomial",
trControl = cv
)
# Print the model results
print(model_logistic)
## Generalized Linear Model
##
## 2926 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 2341, 2341, 2341, 2341, 2340
## Resampling results:
##
## Accuracy Kappa
## 0.8503089 0.06084089
# Extract and print the coefficients
coef_summary <- summary(model_logistic$finalModel)
print(coef_summary)
##
## Call:
## NULL
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.42326 0.10198 -23.763 < 2e-16 ***
## male1 0.55678 0.11190 4.976 6.49e-07 ***
## age 0.70216 0.05814 12.077 < 2e-16 ***
## currentSmoker1 0.46577 0.11670 3.991 6.57e-05 ***
## diabetes1 0.18369 0.35151 0.523 0.60126
## BMI 0.16289 0.05304 3.071 0.00213 **
## glucose 0.18378 0.05793 3.173 0.00151 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2498.2 on 2925 degrees of freedom
## Residual deviance: 2252.3 on 2919 degrees of freedom
## AIC: 2266.3
##
## Number of Fisher Scoring iterations: 5
# Make predictions on the test data using the trained logistic regression model
test_predictions <- predict(model_logistic, newdata = test_predictors)
# Confusion matrix test
test_confusion_matrix <- confusionMatrix(test_predictions, test_target)
# Confusion matrix on the test data
print(test_confusion_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 615 108
## 1 4 3
##
## Accuracy : 0.8466
## 95% CI : (0.8184, 0.872)
## No Information Rate : 0.8479
## P-Value [Acc > NIR] : 0.566
##
## Kappa : 0.0334
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99354
## Specificity : 0.02703
## Pos Pred Value : 0.85062
## Neg Pred Value : 0.42857
## Prevalence : 0.84795
## Detection Rate : 0.84247
## Detection Prevalence : 0.99041
## Balanced Accuracy : 0.51028
##
## 'Positive' Class : 0
##
# Cross-validation
cv <- trainControl(method = "cv", number = 5) # 5-fold cross-validation
# Create the random forest model with cross-validation
model_rf <- train(
x = train_predictors,
y = train_target,
method = "rf", # Specify "rf" for random forest
trControl = cv
)
# Print the model results
print(model_rf)
## Random Forest
##
## 2926 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 2340, 2341, 2341, 2341, 2341
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8458662 0.03814488
## 4 0.8318544 0.08688241
## 6 0.8277524 0.08435492
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# Extract and print the coefficients
coef_summary <- summary(model_rf$finalModel)
print(coef_summary)
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 2926 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 5852 matrix numeric
## oob.times 2926 -none- numeric
## classes 2 -none- character
## importance 6 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 2926 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 6 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
# Make predictions on the test data using the trained random forest model
test_predictions_rf <- predict(model_rf, newdata = test_predictors)
# Confusion matrix random forest
test_confusion_matrix_rf <- confusionMatrix(test_predictions_rf, test_target)
# Print the confusion matrix for the random forest model on the test data
print(test_confusion_matrix_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 615 110
## 1 4 1
##
## Accuracy : 0.8438
## 95% CI : (0.8154, 0.8694)
## No Information Rate : 0.8479
## P-Value [Acc > NIR] : 0.6447
##
## Kappa : 0.0042
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.993538
## Specificity : 0.009009
## Pos Pred Value : 0.848276
## Neg Pred Value : 0.200000
## Prevalence : 0.847945
## Detection Rate : 0.842466
## Detection Prevalence : 0.993151
## Balanced Accuracy : 0.501273
##
## 'Positive' Class : 0
##